San Francisco is a place many people's dream place to work at and live in. But with rising wealth inequality and housing shortages, there is no scarcity of crime in the city by the bay. Thus the questions are being asked: What types of crimes happened the most and where and when do they happen most often? How does public safety varies from place to place? So in this tutorial, we will analyze crime data from 1/1/2003 to 5/13/2015 and answer the question above and hopefully you could have a better idea about San Francisco crime activities and public safety from our analysis.
In this tutorial, we will first prepare, analyze, and visualize the crime data. Then we could interpret the data and make conclusion.
We will be using Python 3 in Jupyter Notebook along with a few imported libraries: pandas, numpy, matplotlib, scikit-learn, seaborn, folium. Below is the import code and their abbreviation we used in our tutorial.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns;
import folium
Thanks to Kaggle and DataSF, the data is prepared into a CSV (Comma Separated Value) file. We could read the CSV files using Pandas, with the ',' being the seperator. Some of the CSV file could also seperated by ';' or ' '(space), simply changed the 'sep' paramter would be good. Panda could also accept excel file or text file. If you want to parse other type of data file, you could read more from here.
df = pd.read_csv("crime_san_francisco.csv", sep = ',')
df.head()
In the above processed data frame, there are 9 columns:
It is very rare that our give data is perfect for doing analysis without any modification.
Then let's take a look at our dataframe size:
df.size
Whoa! Our data frame has size around 8 million. That's a lot of data, in fact, more than enough for us to analyze and for our machines to handle (especially for visualization later).
If we want to do analysis, what we want here is a portion of our population, aka sampling.
Be careful, our sampling should be a random sample and our sample size shouldn't be too small, because we still want to correctly represent the underlying population.
Good thing is that panda dataframe has <a href = "https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sample.html">built in sample function</a> that conduct a random sample without creating any bias.
Here, we conduct sampling at a fraction of one-tenth of the population. Which still leaves us a feasible and decent size to conduct analysis on.
df = df.sample(frac=0.1)
df.size
Now we have a appropiate sized dataframe to work with. However, it's often that our dataframe has irrelevant information to our topics and analysis. In our case, if we want to analyze criminal activities, these few crime categories are not related to our analysis:
Notice this is open-ended and you have to use common sense. I think NON_CRMINAL, RECOVERED VEHICLE, SUICIDE and SUSPICIOUS OCCURANCE data would not help us to understand San Francisco public safety condition. Also OTHER OFFENSES and WARRANT are not informative enough for us to interpret and make a conclusion.
Thus, we will remove the above rows from our dataframes:
df = df[(df.Category != 'NON-CRIMINAL') & (df.Category != 'OTHER OFFENSES') &
(df.Category != 'RECOVERED VEHICLE') & (df.Category != 'SUICIDE') &
(df.Category != 'SUSPICIOUS OCC') & (df.Category != 'WARRANTS') ]
df.head()
Also some colums are irrelvant
df = df.drop(columns=['Descript', 'PdDistrict', 'Resolution', 'Address'])
display(df.head())
Great! Now the dataframe has no irrelavant columns. However, there is one more thing to do. Observe how the dates are being represented here, it's year-month-day then hours-minutes-second. This entire chunk of data is inside of a single cell, which is not easy to work with. We want to separate them into four columns: Year, Month, Day and Hour:
# A typical Dates here would look like : 2011-08-13 19:15:00
df['Year'] = ""
df['Month'] = ""
df['Day'] = ""
df['Hour'] = ""
df['Minute'] = ""
for index, row in df.iterrows():
(dates, time) = row['Dates'].split(' ')
(year, month, day) = dates.split('-')
(hour, minute, second) = time.split(':')
df.at[index, 'Year'] = year
df.at[index,'Month'] = month
df.at[index,'Day'] = day
df.at[index,'Hour'] = hour
df.at[index, 'Minute'] = minute
df.head()
First, df['column'] = "" creates empty column in our dataframe. Then we step through each row of dataframe by calling <a href = "https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iterrows.html">df.iterrows()</a>. While we step through each row, we keep track the index and row object. Then we split the row['Dates'] value into dates and time, notice <a href = "https://www.geeksforgeeks.org/python-string-split/">split()</a> will return a tuple. We do the same thing for dates and time, separating them into year, month, day, hour and minute, then put them into the previous created empty columns. <a href = "https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.at.html">df.at</a> will set a specific value(modiying the dataframe inplace) to the cell at [row, column].
Perfect! Compared with the previous result, this time our dataframe has Year, Month, Day, Hour, and Minute columns that keep track these dates separately. Now we may begin our exploratory data analysis and data visualization.
The central theme of our analysis would be revolved around four variables:
By the end of this analysis, we should know: when, where, and what kind of crimes happened often and how do they change over time. Then we can interpret and make hypothetical policy decision base on top of this analysis.
Let's first take a look at each type of crime's frequency. <a href = "count value function in excel">value_counts()</a> here return each crime category's frequency.
freq_count = df['Category'].value_counts()
print(freq_count)
</p> Now we have a clear image of what the top crimes are. Let's focus on the top 6 crimes since it made up 83% of the total crime. df.value_counts() return a <a href = "https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.Series.html">panda Series</a>, which is a unique object type that support list operator. If we only want the top 6 crimes, we will use a ':', a "slice" operator. This is very useful and easy to work with. <a href = "https://www.pythoncentral.io/how-to-slice-listsarrays-and-tuples-in-python/">Here</a> is more information about python slice operator.
top6_crime = freq_count[0:6]
print(top6_crime)
Let's make a new dataframe that only has these types of crimes. Notice we changed our logical operator to '|' this time.
top6_df = df[(df.Category == 'LARCENY/THEFT') | (df.Category == 'ASSAULT') |
(df.Category == 'DRUG/NARCOTIC') | (df.Category == 'VEHICLE THEFT') |
(df.Category == 'VANDALISM') | (df.Category == 'BURGLARY') ]
top6_df.head()
To help us visualize, we can make a pie chart that labels each of the 6 categories and their frequency in percentages. I am using the default panda dataframe <a href = "https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html">'.plot'</a> function. The variable "autopct" enables us to display the percentage value using Python string formatting. If you want to know more about it, <a href = "https://stackoverflow.com/questions/6170246/how-do-i-use-matplotlib-autopct">here</a> is a really helpful answers on stackoverflow.
pie_chart = top6_df['Category'].value_counts().plot(kind='pie', title = "Top 6 crime percentage", autopct='%1.1f%%', radius = 1.1)
pie_chart.set_ylabel('')
pie_chart
LARCENY/THEFT has the highest percentages (40%) among all the crimes. Therefore, we will focus on it in the machine learning and hypothesis testing part of this tutorial and see if we could grasp its pattern across different time and location. Then we can come up with some solutions to address LARCENY/THEFT problem.
We may now begin our analysis. This section is dedicated to analyzing the time factor of the above crime categories. First, let's graph these top 6 crimes by their changes over the years.
count_crime = top6_df.groupby(['Year', 'Category']).size()
top6_year_df = count_crime.to_frame(name = "Freq").reset_index()
display(top6_year_df.head())
This part is a bit tricky. If we want analyze crime frequency change over time, we want each year's each of the top 6 crime's frequency. So we call <a href = "https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html">groupby()</a> function to group two varaible: Year and Category. It will return the combination of each year and each of the top 6 crime. How do we count the occurrences? The size of each crime group within each year will give you the occurrence. We then convert it to a panda dataframe by using .to_frame function and reset the index to 0. We can now graph it in using <a href = "https://seaborn.pydata.org/generated/seaborn.lineplot.html">seaborn.lineplot</a>.
plt.subplots(figsize=(12,8)) # change our graph dimension
ax = sns.lineplot(x="Year", y="Freq", hue = "Category", data= top6_year_df).set_title("Crime Frequency vs. Year")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) # change our legend position
plt.show()
Now we have the Crim Frequency vs. Year graph for our top 6 crimes. We can see LARCENY/THEFT has consistantly been the most common crime. VEHICLE THEFT starts off very high but suddenly dropped in 2006. The most recent data from 2014-2015 shows all the crime has a significant decrease. This is interesting and we want to simulate its regression later. Let's do the same for Month, Day and Hour:
# Groupby Month and Category
count_crime = top6_df.groupby(['Month', 'Category']).size()
top6_month_df = count_crime.to_frame(name = "Freq").reset_index()
plt.subplots(figsize=(12,8))
ax = sns.lineplot(x="Month", y="Freq", hue = "Category", data= top6_month_df).set_title("Total Crime Frequency vs. Month")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
# Groupby Day and Category
count_crime = top6_df.groupby(['Day', 'Category']).size()
top6_day_df = count_crime.to_frame(name = "Freq").reset_index()
# We want to drop the 31st day because there are less month with 31st day
# Thus creating bias when accumulating data
top6_day_df = top6_day_df[top6_day_df.Day != '31']
plt.subplots(figsize=(12,8))
ax = sns.lineplot(x="Day", y="Freq", hue = "Category", data= top6_day_df).set_title("Total Crime Frequency vs. Day")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
# Groupby Hour and Category
count_crime = top6_df.groupby(['Hour', 'Category']).size()
top6_hour_df = count_crime.to_frame(name = "Freq").reset_index()
plt.subplots(figsize=(12,8))
ax = sns.lineplot(x="Hour", y="Freq", hue = "Category", data= top6_hour_df).set_title("Total Crime Frequency vs. Hour")
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
Some observation: Crime Frequency vs. Hours is particularly interesting. It rise up starting at 6 AM and reach its peak at 6 PM, then falls afterward. This is interesting because many people thought being outside at late night would be dangerous, but clearly there are way more cirmes happens in the afternoon. Surprsingly, Crime rate remained stable when we gropup them by month and days.
For this part, we want to use various modeling to predict future data trend using existing data. For this tutorial, we will be using linear regression. The library widely popular now is scikit-learn. If you want to know more about linear regression and sklearn, <a href = "http://bigdata-madesimple.com/how-to-run-linear-regression-in-python-scikit-learn/">here</a> is a great tutorial.
So from above graphs, LARCENY/THEFT is important to address since it is volatile and made up big portion of all crimes. So let's first look at it by year:
theft_year = top6_year_df[(top6_year_df.Category == 'LARCENY/THEFT')]
display(theft_year.head())
plt.subplots(figsize=(12,8))
plt.xlabel("Year")
plt.ylabel("LARCENY/THEFT Frequency")
plt.title("LARCENY/THEFT vs. Year")
plt.scatter(theft_year.Year, theft_year.Freq)
If 2015 data is not on the graph, we would think LARCENY/THEFT rate is still surging begging at 2012. Now a linear regression would draw line such that this line has minimimum sum of Euclidean distance from the line to the data point. My prediction would be a line cut directly in the middle from about 1300 frequency on the y axis and remain relatively stable throughout the years. To garph this linear regression line, we use be using <a href = "https://scikit-learn.org/dev/modules/generated/sklearn.linear_model.LinearRegression.html">linear_model.LinearRegression()</a>
from sklearn.linear_model import LinearRegression
theft_year = top6_year_df[(top6_year_df.Category == 'LARCENY/THEFT')]
plt.subplots(figsize=(12,8))
plt.xlabel("Year")
plt.ylabel("LARCENY/THEFT Frequency")
plt.title("LARCENY/THEFT vs. Year with Regression line")
plt.scatter(theft_year.Year, theft_year.Freq)
clf = LinearRegression()
# we need our train year list to be a list list, thus, reshape it like [[2012], [2013], [2014]..]
year = np.array(theft_year.Year).reshape(-1, 1)
clf.fit(year, theft_year.Freq)
predicted_freq = clf.predict(year)
plt.plot(theft_year.Year, predicted_freq)
So now we have our regression line, but is our data well represented? In another word, would our data deviate from our regression line a lot? To answer this question, we need the coefficient of the determinant or R-squared(R^2) values. This is particularly useful when we try to interpret a scatter plot regression line.
print(clf.score(year,theft_year.Freq))
The R-squared data is very bad, usually it has to be above 0.5. The reason is obvious. Ever since 2012, data frequency went up really high and then suddenly dropped in 2015. This regression line had deviated too much from those data point. Thus, this regression line is not trustworthy for us to interpret and make a conclusion. If you want to learn more about R-squared data, <a href = "http://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit">here</a> is very good description.
Let's try it out for LARCENY/THEFT vs. Hour and see if its data is better for regression analysis. If we want to know average LARCENY/THEFT frequency at what hour in a single year, we have to divide the frequency by 13 to obtain the mean for one year's average. The reason being top6_hour_df calculated the sum of all 13 years theft frequency at different hours(group by hours and category).
theft_hour = top6_hour_df[(top6_hour_df.Category == 'LARCENY/THEFT')]
for index, row in theft_hour.iterrows():
theft_hour.at[index, 'Freq'] = row['Freq']/13
plt.subplots(figsize=(12,8))
plt.xlabel("Hours")
plt.ylabel("LARCENY/THEFT Frequency")
plt.title("LARCENY/THEFT vs. Hours with Regression line")
plt.scatter(theft_hour.Hour, theft_hour.Freq)
clf = LinearRegression()
hour = np.array(theft_hour.Hour).reshape(-1, 1)
clf.fit(hour, theft_hour.Freq)
predicted_freq = clf.predict(hour)
plt.plot(theft_hour.Hour, predicted_freq)
plt.show()
print(clf.score(hour,theft_hour.Freq))
This time our R-squared data is good! But can we do better? Observe the points together looks like a y = x^3 graph but goes diaganoly. What we need is a polynomial regression line with a degree of 3. We can use <a href = "https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html">PolynomialFeatures</a> from sklearn.preprocessing. This lets us customize what degree of our regression line should be. So, let's try it out here:
from sklearn.preprocessing import PolynomialFeatures
theft_hour = top6_hour_df[(top6_hour_df.Category == 'LARCENY/THEFT')]
for index, row in theft_hour.iterrows():
theft_hour.at[index, 'Freq'] = row['Freq']/13
plt.subplots(figsize=(12,8))
plt.xlabel("Hours")
plt.ylabel("LARCENY/THEFT Frequency")
plt.title("LARCENY/THEFT vs. Hours with Polynomial Degree of 3 Regression line")
plt.scatter(theft_hour.Hour, theft_hour.Freq)
clf = LinearRegression()
poly_feat = PolynomialFeatures(degree=3)
hour = np.array(theft_hour.Hour).reshape(-1, 1)
print("before poly_transform: ")
for x in range(3):
print(hour[x])
hour_poly = poly_feat.fit_transform(hour)
print("after poly_transform: ")
for x in range(3):
print(hour_poly[x])
clf.fit(hour_poly, theft_hour.Freq)
predicted_freq = clf.predict(hour_poly)
plt.plot(theft_hour.Hour, predicted_freq)
plt.show()
print("Our R-Squares is: ")
print(clf.score(hour_poly,theft_hour.Freq))
A little explanation here: We define our degree = 3 in PolynomialFeatures. Then we transform our independent variable 'hour' to the degree of 3. I printed out 'hour' and 'hour_poly' to compare. Notice we now have 4 features for each array, each one corresponding to degree 0-3(the first column is all 1 because the degree of 0 is always 1). Then our regression would try to construct a polynomial to fits these features.
This time our R-squares score is 0.926, almost perfect(1 being perfect)! Now this could be a very reliable to predict our LARCENY/THEFT frequency. Let's say I want to know how many LARCENY/THEFT happening at 5 AM in the morning, we can just do:
display(clf.predict(poly_feat.fit_transform([[5]])))
display(clf.predict(poly_feat.fit_transform([[18]])))
So according to our model, on average, at 5 AM there will be 16 larceny/theft crime commited but 94 at 18 PM!
In this section we want to analyzing the location factor of the above top 6 crime categories. I use folium liberary here as I linked in the beginning of the tutorial. We first need to zoom in at San Francisco. After a google search it turns out to be: (37.7749, -122.4194)
crime_map = folium.Map(location=[37.7749, -122.4194], zoom_start=12)
crime_map
Our current top6_df has 440k data in it. It's not visually clear when we try to squeeze 440k data in this tiny area. Thus we cut it down to about 90k data to graph here. I am using .samples again and assign it top6_vis_df
top6_vis_df = top6_df.sample(frac = 0.2)
display(top6_vis_df.head())
display(top6_vis_df.size)
Now we just need to step through our datafarme to graph each crime data based on their location. Remember I preserved the X and Y column? We can just directly use it to graph it on our map. For each of the crime category, we will be assign a different <a href = "https://www.google.com/search?q=hex+color&rlz=1C5CHFA_enUS760US760&oq=hex+color&aqs=chrome.0.69i59j0l5.862j0j7&sourceid=chrome&ie=UTF-8">hex color code</a> to it. Here I am using folium.Circle to graph our data, you could also assign a popup variable that contains an extra description for each data point.
for index, row in top6_vis_df.iterrows():
if(row['Category'] == 'LARCENY/THEFT' ):
folium.Circle(radius=50, location=[row['Y'], row['X']], color='#262121',
popup='LARCENY/THEFT', fill=True,).add_to(crime_map)
elif(row['Category'] == 'BURGLARY'):
folium.Circle(radius=50, location=[row['Y'], row['X']], color='#ff0c0c',
popup='BURGLARY', fill=True,).add_to(crime_map)
elif(row['Category'] == 'DRUG/NARCOTIC'):
folium.Circle(radius=50, location=[row['Y'], row['X']], color='#ffe100',
popup='DRUG/NARCOTIC', fill=True,).add_to(crime_map)
elif(row['Category'] == 'ASSAULT'):
folium.Circle(radius=50, location=[row['Y'], row['X']], color='#00ff33',
popup='ASSAULT', fill=True,).add_to(crime_map)
elif(row['Category'] == 'VANDALISM'):
folium.Circle(radius=50, location=[row['Y'], row['X']], color='#0061ff',
popup='VANDALISM', fill=True,).add_to(crime_map)
else:
folium.Circle(radius=50, location=[row['Y'], row['X']], color='#f702eb',
popup='VEHICLE THEFT', fill=True,).add_to(crime_map)
crime_map